Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make podRef to be unique #433

Closed
wants to merge 3 commits into from
Closed

Conversation

pliurh
Copy link

@pliurh pliurh commented Mar 4, 2024

We used podRef with format <namespace>/<pod_name> to identify the IP allocation. However, in scenarios like replicaSet, a different pod can reuse the pod name. This patch changes the format of podRef to <namespace>/<pod_name>:<pod_uid> to ensure each ip allocation will have a unique identifier. The legacy podRef format will be updated automatically during an upgrade.

Sometimes, we want to run the reconciler outside of the cluster
for development convenience. This patch allows the reconciler
process to use the KUBECONFIG environment variable.

Signed-off-by: Peng Liu <[email protected]>
@pliurh
Copy link
Author

pliurh commented Mar 4, 2024

fix #176

@maiqueb
Copy link
Collaborator

maiqueb commented Mar 4, 2024

Why have you chosen that format - /<pod_name>:<pod_uid> ?

If you just need the pod ref to be unique, shouldn't we simply use the pod's UID ?

Is the name part meant to help debugging or something like that ?

@pliurh
Copy link
Author

pliurh commented Mar 4, 2024

The new naming format shall be namespace/pod_name:pod_uid. And yes, it's meant for debugging.

Copy link
Member

@dougbtv dougbtv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'd love to have another review on it before a merge, but overall this approach makes sense. Thank you Peng!

cfg, err = clientcmd.BuildConfigFromFlags("", kubeConfigFile)
if err != nil {
return nil, fmt.Errorf("failed to generate kubeconfig from file %s: %v", kubeConfigFile, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the improved logging on error

}

// migrate the podRef format if needed
if err := looper.migrationPodRef(ctx, ipPools); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I get it how it triggers the migration, so here in the control loop, we check if the migration has happened and apply it here.

k8sClient: *k8sClient,
liveWhereaboutsPods: indexPods(pods, whereaboutsPodRefs),
requestTimeout: timeout,
pods, err := k8sClient.ListPods(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we limit this to just pods with */networks annotations maybe? Unsure if this is premature optimization?

Something like:

    // Define label selector to filter pods with specific annotations
    labelSelector := metav1.LabelSelector{
        MatchExpressions: []metav1.LabelSelectorRequirement{
            {
                Key:      "k8s.v1.cni.cncf.io/networks",
                Operator: metav1.LabelSelectorOpExists,
            },
        },
    }

    // List pods matching the label selector
    pods, err := k8sClient.CoreV1().Pods("").List(ctx, metav1.ListOptions{
        LabelSelector: metav1.FormatLabelSelector(&labelSelector),
    })

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I had this wrong, this is abstracted. It's fine as is. Thanks!

fakewbclient "github.com/k8snetworkplumbingwg/whereabouts/pkg/client/clientset/versioned/fake"
)

var _ = FDescribe("MigrationPodRef", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the migration unit test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to remove the focus from these tests, otherwise, you'll highjack the CI to only run these tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I forgot it.

@maiqueb
Copy link
Collaborator

maiqueb commented Mar 4, 2024

@dougbtv please allow me some time to review this PR.

@pliurh you need to sort out the e2e and the unit tests.

I'll keep peeking to see once they're good.

@pliurh pliurh force-pushed the podref branch 3 times, most recently from 9fa1805 to 037d68b Compare March 6, 2024 04:33
@pliurh
Copy link
Author

pliurh commented Mar 6, 2024

/hold

@pliurh pliurh force-pushed the podref branch 2 times, most recently from 3a84167 to 84bcac4 Compare March 7, 2024 03:49
@pliurh
Copy link
Author

pliurh commented Mar 7, 2024

/unhold
@maiqueb @dougbtv I revised the PR. Please take another look.

@pliurh pliurh force-pushed the podref branch 6 times, most recently from 5e0d6ab to b823fee Compare March 11, 2024 07:50
We used podRef with format <namespace>/<pod_name> to identify the IP
allocation. However, in scenarios like replicaSet, a different
pod can reuse the pod name. This patch changes the format of podRef
to <namespace>/<pod_name>:<pod_uid> to ensure each ip allocation
will have a unique identifier. The legacy podRef format will be
updated automatically during an upgrade.

Signed-off-by: Peng Liu <[email protected]>
The containerID of a pod will change after node reboot. It caused
the IP allocation/deallocation operation cannot find the IP that
has been reserved by the pod.

Signed-off-by: Peng Liu <[email protected]>
if err != nil {
return nil, fmt.Errorf("failed to implicitly generate the kubeconfig: %w", err)
logging.Debugf("failed to generate the kubeconfig from service account: %v", err)
kubeConfigFile := os.Getenv("KUBECONFIG")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both newPodController and ReconcileIPs are exclusively called from controlloop.go:main().
I know that the operation doesn't cost anything, but it rubs me the wrong way to do kubeConfigFile := os.Getenv("KUBECONFIG") in each of them, the correct place imo is in controlloop.go:main() and then pass kubeConfigFile as an argument to both newPodController and ReconcileIPs . Perhaps that allows even for more optimizations

cfg, err := rest.InClusterConfig()
var cfg *rest.Config
var err error
cfg, err = rest.InClusterConfig()
Copy link
Contributor

@andreaskaris andreaskaris Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is run inside the cluster, but the KUBECONFIG env variable is set:

  • ip.go will use NewReconcileLooperWithKubeconfig due to the if condition
  • newPodController will use rest.InClusterConfig because in cluster rest.InClusterConfig will succeed

So your code can end up using the in cluster config for one, and the kubeconfig for the other

@@ -39,6 +39,7 @@ const (
cronSchedulerCreationError
fileWatcherError
couldNotCreateConfigWatcherError
initialReconsileFailed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: initialReconsileFailed -> initialReconcileFailed

// trigger one immediate reconcile before cron job start
go reconciler.ReconcileIPs(errorChan)
err = <-errorChan
if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

if err = <-errorChan; err != nil {
	logging.Verbosef("initial reconciler failure: %s", err)
	os.Exit(initialReconcileFailed)
}
logging.Verbosef("initial reconciler success")

(see https://go.dev/play/p/ltsR09kXKW9)

return logging.Errorf("failed to list overlappingrangeipreservations %v", err)
}
for _, ip := range reservations {
if regexp.MustCompile(pattern).MatchString(ip.Spec.PodRef) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - the regexp doesn't have to be compiled on every loop iteration:

const (
    podRefPattern = `^([^\s/]+)/([^\s/]+):([^\s/]+)$`
)

(...)

func (rl ReconcileLooper) migrationPodRef(ctx context.Context, ipPools []storage.IPPool) error {
    re := regexp.MustCompile(podRefPattern)
(...)
    for _, ip := range reservations {
		if re.MatchString(ip.Spec.PodRef) {

ipReservations = nil
needUpdate := false
for _, r := range pool.Allocations() {
if !regexp.MustCompile(pattern).MatchString(r.PodRef) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

Expect(err).NotTo(HaveOccurred())

found := false
pattern := `^([^\s/]+)/([^\s/]+):([^\s/]+)$`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reuse podRefPattern from iploop.go once that const is created, instead of redefining the pattern (less error prone):

const (
    podRefPattern = `^([^\s/]+)/([^\s/]+):([^\s/]+)$`
)

Expect(err).NotTo(HaveOccurred())
for _, pool := range ipPools {
for _, r := range pool.Allocations() {
if !regexp.MustCompile(pattern).MatchString(r.PodRef) {
Copy link
Contributor

@andreaskaris andreaskaris Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dito as earlier, compile outside of the loop, instead of compiling every single time

ips, err := reconcileLooper.k8sClient.ListOverlappingIPs(ctx)
Expect(err).NotTo(HaveOccurred())
for _, ip := range ips {
if !regexp.MustCompile(pattern).MatchString(ip.Spec.PodRef) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

return nil, logging.Errorf("Error marshaling overlappingrangeipreservations.whereabouts.cni.cncf.io spec: %v", err)
}

return i.client.WhereaboutsV1alpha1().OverlappingRangeIPReservations(clusterWideIP.GetNamespace()).Patch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: does this have to be Patch, don't we have an Update() for that? (I might be completely wrong)

Copy link
Contributor

@andreaskaris andreaskaris Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IP reconciler runs on multiple nodes, there will be racing conditions if using Update() in this case. However, using Patch() could cause too many API requests. We may need a leader election here.

I agree that we need to add a retry mechanism. This is a general problem in the IP reconciler, it doesn’t retry when reconciliation fails.

@andreaskaris
Copy link
Contributor

andreaskaris commented Mar 22, 2024

I've got a question ... will this actually work with Statefulsets and guarantee that the same IP address is always assigned to the same Statefulset pod? Because the uid between Statefulset pods changes:

 resourceVersion: ""
[akaris@workstation ~]$ oc get pods -o yaml | grep uid
      uid: b5a56769-43e0-4d8c-85bd-8a5f105986fb
    uid: 4ecd9b66-976e-4be3-84fa-cfe7939abfa9
[akaris@workstation ~]$ oc get pods -o yaml | grep uid
      uid: b5a56769-43e0-4d8c-85bd-8a5f105986fb
    uid: 4ecd9b66-976e-4be3-84fa-cfe7939abfa9
[akaris@workstation ~]$ oc delete pod tools-ipv4-0
pod "tools-ipv4-0" deleted
[akaris@workstation ~]$ 
[akaris@workstation ~]$ oc get pods
NAME           READY   STATUS              RESTARTS   AGE
tools-ipv4-0   0/1     ContainerCreating   0          3s
[akaris@workstation ~]$ oc get pods -o yaml | grep uid
      uid: b5a56769-43e0-4d8c-85bd-8a5f105986fb
    uid: f4dc518a-da66-4bad-832b-32394ea08299
[akaris@workstation ~]$ 
[akaris@workstation ~]$ 
[akaris@workstation ~]$ oc get pods -o yaml | grep uid
      uid: b5a56769-43e0-4d8c-85bd-8a5f105986fb
    uid: f4dc518a-da66-4bad-832b-32394ea08299
[akaris@workstation ~]$ 
[akaris@workstation ~]$ 
[akaris@workstation ~]$ oc get pods -o yaml | grep uid
      uid: b5a56769-43e0-4d8c-85bd-8a5f105986fb
    uid: f4dc518a-da66-4bad-832b-32394ea08299

I'm trying to understand when and why it would ever be a problem that an allocation is being reused (in the case of Statefulsets, isn't it even desired that an allocation is reused?) when the combination of namespace/name is actually a unique identifier for a pod https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/

.. well, just looking at this right now in an SNO cluster that I have, the IP address of my single statefulset pod is actually changed after reboot ..


To clarify my point:

to ensure each ip allocation will have a unique identifier.

The commit's comment is misleading, because the identifier <namespace>/<name> is guaranteed to be unique, always, by kubernetes:
https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/. That's used for e.g. statefulsets, where the pod uid is not the same, but the pod after recreation is "the same" for a set of specific attributes.
So yes, identity is not the same ... but as an admin, I personally would expect key attributes to be the same, and I'd say I'd expect whereabouts to assign the same IP address! So, if pod uid-a goes down, and pod uid-b comes up with the same <namespace>/<name>, ideally, whereabouts would assign the exact same IP address; at the very least, it shouldn't matter if it assigned the same IP.

In kubernetes, there will never ever be 2 pods with the same <namespace>/<name> at the same time. So I would argue (not looking at the current implementation) that the allocator should look at the current allocations, and if an allocation for <namespace>/<pod> already exists, it should return the IP address and maintain the allocation, instead of trying to de-allocate and then re-allocate a new IP; wouldn't that likewise fix https://issues.redhat.com/browse/OCPBUGS-29648 and be easier / cleaner?

@pliurh
Copy link
Author

pliurh commented Mar 25, 2024

@andreaskaris I think you made a great point. Instead of addressing it from the IP reconciler's perspective, resolving it at the allocation stage would be more effective.

@pliurh pliurh closed this Mar 25, 2024
@pliurh
Copy link
Author

pliurh commented Mar 27, 2024

replaced by #446

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants